# Print with single quotes
print('Hello world!')
[1] "Hello world!"
# Print with double quotes
print("Hello world!")
[1] "Hello world!"
In this chapter, we explore various text manipulation techniques in R. More specifically, we start by discussing how to handle strings, focusing on printing and combining (pasting) them. Next, we introduce the stringr
package, part of the tidyverse family. This package includes functions that make handling text data much easier compared to base R. Lastly, we cover how to identify and work with patterns in text data, a concept known as regular expressions.
Before we begin, let’s acquaint ourselves with data types, data structures and functions in R.
When working with numeric data, it is quite intuitive to perform operations like addition or multiplication on vectors. However, manipulating strings (or character data), one of the core data types in R, requires specific functions. String manipulation can become complex, especially when combining strings from a single vector or different columns of a data frame. With text data, we can perform tasks such as adding or replacing text, finding matches, counting letters, locating positions of specific text characters, and much more.
We can use single quotes (’’) or double quotes (““) to specify a value (any value) as a string. For instance, suppose we want to print one of the most well known phrases in the Computer Science, Data Science and Data Engineering world,”Hello world!“. We can print this phrase with the print()
function, enclosing the text in single or double strings:
# Print with single quotes
print('Hello world!')
[1] "Hello world!"
# Print with double quotes
print("Hello world!")
[1] "Hello world!"
In both cases, we see that we get the exact same results. However, what happens if we need to have double or single quotes within a string? Since R would not know which quotes we want to include in the string, we need to be able to clarify which quotes are part of the text itself, and which quotes are used to indicate a string. To do so, we need to use what is called an escape sequence. For this, we use the special character backslash (\) before the single or double quotes that we want to include in the string and the cat()
function. The cat()
function is used to concatenate and display text in a way that is more suitable for string formatting, especially when working with escape sequences or when we want to print text exactly as it appears, without additional characters like quotes or backslashes. Unlike the print()
function, which shows the internal representation of objects (including quotes around strings), the cat()
function outputs the string as plain text.
# Print "I want to print "Hello World"" with print()
print("I want to print \"Hello World!\"")
[1] "I want to print \"Hello World!\""
# Print "I want to print "Hello World!"" with cat()
cat("I want to print \"Hello World!\"")
I want to print "Hello World!"
The print()
function displays the string as "He said, \"Hello world!\""
because it shows both the double quotes and the backslashes. However, the cat()
function displays He said, "Hello world!"
without the additional characters, making the output cleaner and more readable.
Using cat()
is particularly useful when formatting strings that include special characters such as quotes, as it provides more control over how the output appears.
The paste()
function can be used when we want to combine two or more string values into a simple string. For instance, we can use the paste()
function to print “Hello World”, when the two words are separate strings. The example below can help us understand the difference.
# Print "Hello World" with paste
paste("Hello", "World!")
[1] "Hello World!"
We see that the paste
function combines the two string values into one string, separating them by a space. This occurs because the default separator of the paste function is space. We can use a different separator though by changing the argument sep
. For example, suppose we want to print the text “Data-Science”.
# Print "Data-Science"
paste("Data", "Science", sep = "-")
[1] "Data-Science"
Things can become complicated when we start including whole character vectors instead of a single string value inside the paste()
function. For instance, suppose we have a scalar (one-element vector or just a single value as before) and a vector of two elements (“Science” and “Analytics”).
# Print with a scalar and a vector
<- "Data"
scalar <- c("Science", "Analytics")
vector paste(scalar, vector, sep = "-")
[1] "Data-Science" "Data-Analytics"
When we have two vectors of the same length, vectorization takes place, as we would expect with vector inputs. This means that each element of one vector is combined with the corresponding element of the other vector.
# Print with two vectors
<- c("Data", "Science")
vector1 <- c("Data", "Analytics")
vector2 paste(vector1, vector2, sep = "-")
[1] "Data-Data" "Science-Analytics"
With vectors that are not of the same length, R will recycle the shorter vector to match the length of the longer one. This means that the shorter vector is repeated until it matches the length of the longer vector, which can sometimes lead to unexpected or undesired results if not used carefully.
# Print with two vectors
<- c("Data", "Science")
vector1 <- c("Data", "Analytics", "Engineering")
vector2 paste(vector1, vector2, sep = "-")
[1] "Data-Data" "Science-Analytics" "Data-Engineering"
If we want to combine all elements together, we can use the collapse
argument, including a character based on which we want to make this combination. For instance, suppose we add to the last example the argument collapse
with the value ” and ” (notice that we included spaces inbetween).
# Print with a scalar and a vector
<- "Data"
scalar <- c("Science", "Analytics")
vector paste(scalar, vector, sep = "-", collapse = " and ")
[1] "Data-Science and Data-Analytics"
# Print with two vectors
<- c("Data", "Science")
vector1 <- c("Data", "Analytics")
vector2 paste(vector1, vector2, sep = "-", collapse = " and ")
[1] "Data-Data and Science-Analytics"
Lastly, a variation of the function paste()
is the function paste0()
. The difference between these two functions is that the first leaves a space between every piece of text we include, while the second does not.
# Print "Hello World" with paste
paste("Hello", "World!")
[1] "Hello World!"
# Print "Hello World" with paste0
paste0("Hello", "World!")
[1] "HelloWorld!"
When we use the paste()
or the paste0()
function, it is a good idea to try some prints just to make sure that the output is the expected one. We saw that when we have vectors inside the function, things can become complicated.
stringr
PackageAs mentioned at the beginning, the package we can use for text manipulation is the stringr
package. Although base R already provides many alternative functions to manipulate strings, it is better to use the stringr
package due to its consistency on the basic syntax, i.e.,:
With RStudio, this is very handy as we are not concerned about remembering all these string manipulation functions: when we type “str”, RStudio automatically shows us all available alternatives.
Let’s start by loading the stringr
package and creating a vector of character values.
# Library
library(stringr)
# Quotes
<- c("Become a Master in Data Science.",
quotes "The best way to learn data science is to do data science.",
"Text mining is an essential skill.")
With this character vector, we can experiment using the many functions of the stringr
package, all with different purposes. For instance, suppose we want to check whether the word “is” exists within each value of the vector. We can do this by using the str_detect()
function, since we try to “detect” whether a specific pattern exists within a string.
# Is the pattern in the string?
str_detect(quotes, pattern = "is")
[1] FALSE TRUE TRUE
As expected, we get the values FALSE, TRUE and TRUE because the word “is” is found within the second and the third element of the vector but not in the first one.
Another similar function is str_which()
function. This function is similar to the which()
function from base R and shows in which elements the specified pattern exists.
# Return the indexes of entries that contain the pattern
str_which(quotes, pattern = "is")
[1] 2 3
Regarding sub-setting strings, the functions str_sub()
and str_subset()
can be used. The first subsets a string based on specified positions while the second subsets a string based on a specified pattern. The example below shows how they work and helps us understand the difference between the two.
# Extract the first 6 characters
str_sub(quotes, start = 1, end = 6)
[1] "Become" "The be" "Text m"
# Return the subset of the strings that contains the word "Master"
str_subset(quotes, pattern = "Master")
[1] "Become a Master in Data Science."
If we want to check whether a specific pattern exists within a string, the str_view()
function emphasizes this pattern (if it exists of course).
# Emphasize word "is"
str_view(quotes, pattern = "is")
[2] │ The best way to learn data science <is> to do data science.
[3] │ Text mining <is> an essential skill.
With the str_split()
function, we can split a string into a list with its parts being separated by a specified pattern. In the example below, we see how each element is a different part of the list as well as how the string in the second and third element is split within the list.
# Split the quotes to create a list
str_split(quotes, pattern = "is")
[[1]]
[1] "Become a Master in Data Science."
[[2]]
[1] "The best way to learn data science " " to do data science."
[[3]]
[1] "Text mining " " an essential skill."
We see that, in practice, the functions found in stringr
are simple and effective. There are many other functions in stringr
, but the table below provides an overview regarding the ones most commonly used. More specifically, the below table includes the names of the functions, their description, a usage example using the vector “quotes” (the one previously created) and the respective output. It is advisable to come back and check this table when we want to solve a task that includes strings.
Function | Description |
---|---|
str_detect() | Is the pattern in the string? |
str_which() | Return the indexes of entries that contain the pattern |
str_sub() | Extract the characters based on a specified positions (e.g. from 1 to 6) |
str_subset() | Return the subset of the strings that contains the pattern(e.g. “Master”) |
str_replace() | Replace the first part of a string with another (if pattern is matched) |
str_replace_all() | Replace all parts of a string with another (if pattern is matched) |
str_locate() | Return positions of the first occurrence of the specified pattern |
str_locate_all() | Return positions of all occurrences of the specified pattern |
str_to_upper() | Change all characters to upper case letters |
str_to_lower() | Change all characters to lower case letters |
str_to_title() | Change first character to upper and rest to lower |
str_length() | Number of characters in a string |
str_count() | Count number of times a pattern appears in a string |
str_replace_na() | Replace all NAs to a new specified value |
str_trim() | Remove white space at the start and at the end of a string |
str_sort() | Sort the vector in alphabetical order |
str_order() | Indexes to order the vector in alphabetical order |
str_trunc() | Truncate a string to a fixed size (the dots consume 3 spots) |
str_c() | Joining strings |
str_view_all() | Emphasize all the parts of a string that match the pattern |
str_split() | Split a string into a list with its parts to be separated by the pattern |
In R, regular expressions are pattern-matching tools that enable the concise and flexible manipulation of text data by providing a syntax for specifying search patterns and facilitating string matching and manipulation operations. Put simply, we use regular expressions to describe patterns in strings. To understand what this means and how we can use regular expressions, we will use the string “Data!” with the function str_detect()
that we discussed previously.
# Check Regular Expression for "Data!"
str_detect("Data!", pattern = "^....!")
[1] TRUE
What exactly is this pattern? As we see, we just matched the pattern of “Data!” using a sequence of special characters. The special character caret (^) signifies the start of a string, without considering (or representing) the first letter. Then, we used the special character dot (.) 4 times because a dot represents a single letter in our string. Since the word “Data” contains 4 letters, we used dot 4 times to capture the pattern. Lastly, we included the special character exclamation mark (!) because it appears in our string. As a result, we described the pattern of the string “Data!” fully and that is why we got TRUE as an output. It is important to understand that the exact same regular expression would describe similar strings such as “Math!” or “Stat!” as the pattern is exactly the same (4 letters, followed by an exclamation mark).
# Check Regular Expression for "Math!"
str_detect("Math!", pattern = "^....!")
[1] TRUE
# Check Regular Expression for "Stat!"
str_detect("Stat!", pattern = "^....!")
[1] TRUE
That is actually the difference between regular expressions and using the exact same value of a string as a pattern. Had we used the value “Data!” in the argument pattern
, of course we would get the output TRUE in the first example but we would get FALSE in the other two examples. Because our purpose is to describe the general pattern of sequential values in a vector, it is very useful to be able to describe those patterns.
# Check Regular Expression for "Data!" with the pattern "Data!"
str_detect("Data!", pattern = "Data!")
[1] TRUE
# Check Regular Expression for "Math!" with the pattern "Data!"
str_detect("Math!", pattern = "Data!")
[1] FALSE
# Check Regular Expression for "Stat!" with the pattern "Data!"
str_detect("Stat!", pattern = "Data!")
[1] FALSE
The main question that arises of course is “what is the point of learning regular expressions?”. Regular expressions are very useful when it comes to text data manipulation. For instance, suppose we have a vector that describes the body weight of 5 people.
# Create a vector
<- c("75 KG", "82 KG", "85 KG", "68 KG", "79 KG")
body_weight
# Print body_weight
body_weight
[1] "75 KG" "82 KG" "85 KG" "68 KG" "79 KG"
# Print the class of the bodyweight
class(body_weight)
[1] "character"
Since we have text in this vector, the data type of our created vector is “character”. In practice though, we would probably want to separate those numeric values from the character values within each element of the vector “body_weight” in order to perform analysis on the numeric values. In other words, there is no need to have the value of “kg”in our vector. By describing the pattern in the function str_remove()
from the stringr
package we can remove those strings using regular expressions.
# Create a vector
<- str_remove(string = body_weight, pattern = c(" ..$"))
body_weight
# Print body_weight
body_weight
[1] "75" "82" "85" "68" "79"
# Print the class of the body_weight
class(body_weight)
[1] "character"
This simple example clearly illustrates the value of regular expressions. However, regular expressions can be very confusing, especially when we are working with more complicated strings. For this reason, we will learn how to create regular expressions with the rebus
package, which contains functions to help us construct regular expressions easier, in a way closer to human language. Although not part of tidyverse, this package can be greatly combined with the stringr
package. To see how this works, let’s load the rebus
package and use its syntax to describe the same string “Data!” (with the str_detect()
function).
# Library
library(rebus)
# Check Regular Expression for "Data!" with base R
str_detect("Data!", pattern = "^....!")
[1] TRUE
# Check Regular Expression for "Data!" with rebus
str_detect("Data!",
pattern = START %R% ANY_CHAR %R%
%R% ANY_CHAR %R% ANY_CHAR %R% "!") ANY_CHAR
[1] TRUE
We see that the pattern is much more understandable the way we wrote it using the rebus
package. We start (START
) the pattern, then we use 4 times the syntax ANY_CHAR
because the word “Data” consists of 4 letters, and finally we use an exclamation mark in double quotes. The special operator %R%
can be read as “followed by” or “then”. With the mentioned syntax though, we describe the whole pattern of the value “Data!”. We could describe this word in many different ways, such as the ones in the following example.
# "Data!" - the pattern is that the character value starts with any character
str_detect("Data!", pattern = START %R% ANY_CHAR)
[1] TRUE
# "Data!" - the pattern is that the character value ends with exclamation mark
str_detect("Data!", pattern = "!" %R% END)
[1] TRUE
In both cases, we see that we get the value of TRUE. It is therefore important to understand that there is no need to describe the whole pattern every time. How we will describe a pattern though really depends on the underlying data. Especially with regular expressions, it is important to practice and try to understand the output that we get when we use a specific pattern. In our previous example, if we set the pattern to the value of “START %R% one_or_more(DGT)”, we get the value of FALSE. Clearly, the reason would be that the value “Data!” does not start with one or more digits, but if we had the value “4-Data!”, we would get the value TRUE.
# Describe "Data!"
str_detect("Data!", pattern = START %R% one_or_more(DGT))
[1] FALSE
# Describe "4-Data!"
str_detect("4-Data!", pattern = START %R% one_or_more(DGT))
[1] TRUE
Now that we have discussed the intuition behind regular expressions and the rebus
package, we can focus on the more technical details. The table below summarizes the syntax for different regular expressions using the corresponding syntax of the rebus
package. It is important to note that the syntax that we use with rebus
is NOT considered a regular expression; we use rebus
to construct a regular expression in an easier way.
Regular_Expression | Rebus | Description |
---|---|---|
^ | START | Start of a string |
$ | END | End of a string |
. | ANY_CHAR | Any single character |
? | optional() | Optional pattern |
* | zero_or_more() | Zero or more occurences |
+ | one_or_more() | One or more occurences |
{} | repeated() | Repeated pattern |
| | or() | Choice among alternatives |
[] | char_class() | Any character within a specified set |
[^] | negated_char_class() | Any character NOT in a specified set |
\^ | CARET | Caret sign |
\$ | DOLLAR | Dollar sign |
\. | DOT | Dot sign |
\d | DGT | Any digit |
\w | WRD | Any character |
\s | SPC | Any whitespace |